Can Star Wars-related survey responses be used to predict whether a person earns more than $50,000 annually??
Data Loading & Missing Value Analysis
Show the code
# First 2 rows are column headers, created a map to make it easy to readnew_col_names = ["id", "seen_any", "fan_starwars", "seen_Ep_I", "seen_Ep_II", "seen_Ep_III", "seen_Ep_IV", "seen_Ep_V", "seen_Ep_VI", "rank_Ep_I", "rank_Ep_II", "rank_Ep_III", "rank_Ep_IV", "rank_Ep_V", "rank_Ep_VI", "fav_han", "fav_luke", "fav_leia", "fav_anakin", "fav_obi", "fav_palpatine", "fav_darth", "fav_lando", "fav_boba", "fav_c3po", "fav_r2", "fav_jar", "fav_padme", "fav_yoda", "who_shot_first", "familiar_expanded_universe", "fan_expanded_universe", "fan_startrek", "gender", "age", "income", "educ", "region"] # Read in the data to a pandas data-frame from CSV githubdf = pd.read_csv("https://github.com/fivethirtyeight/data/raw/master/star-wars-survey/StarWars.csv", encoding_errors="ignore", names = new_col_names, skiprows =2)# Replace common missing indicators with np.nandf = df.replace(['NA', 'N/A', 'na', 'n/a', '', 'NaN', 'nan', 999, -99, None], np.nan)# Display a glimpse of the datadisplay(df.head())
id
seen_any
fan_starwars
seen_Ep_I
seen_Ep_II
seen_Ep_III
seen_Ep_IV
seen_Ep_V
seen_Ep_VI
rank_Ep_I
...
fav_yoda
who_shot_first
familiar_expanded_universe
fan_expanded_universe
fan_startrek
gender
age
income
educ
region
0
3292879998
Yes
Yes
Star Wars: Episode I The Phantom Menace
Star Wars: Episode II Attack of the Clones
Star Wars: Episode III Revenge of the Sith
Star Wars: Episode IV A New Hope
Star Wars: Episode V The Empire Strikes Back
Star Wars: Episode VI Return of the Jedi
3.0
...
Very favorably
I don't understand this question
Yes
No
No
Male
18-29
NaN
High school degree
South Atlantic
1
3292879538
No
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
...
NaN
NaN
NaN
NaN
Yes
Male
18-29
$0 - $24,999
Bachelor degree
West South Central
2
3292765271
Yes
No
Star Wars: Episode I The Phantom Menace
Star Wars: Episode II Attack of the Clones
Star Wars: Episode III Revenge of the Sith
NaN
NaN
NaN
1.0
...
Unfamiliar (N/A)
I don't understand this question
No
NaN
No
Male
18-29
$0 - $24,999
High school degree
West North Central
3
3292763116
Yes
Yes
Star Wars: Episode I The Phantom Menace
Star Wars: Episode II Attack of the Clones
Star Wars: Episode III Revenge of the Sith
Star Wars: Episode IV A New Hope
Star Wars: Episode V The Empire Strikes Back
Star Wars: Episode VI Return of the Jedi
5.0
...
Very favorably
I don't understand this question
No
NaN
Yes
Male
18-29
$100,000 - $149,999
Some college or Associate degree
West North Central
4
3292731220
Yes
Yes
Star Wars: Episode I The Phantom Menace
Star Wars: Episode II Attack of the Clones
Star Wars: Episode III Revenge of the Sith
Star Wars: Episode IV A New Hope
Star Wars: Episode V The Empire Strikes Back
Star Wars: Episode VI Return of the Jedi
5.0
...
Somewhat favorably
Greedo
Yes
No
No
Male
18-29
$100,000 - $149,999
Some college or Associate degree
West North Central
5 rows × 38 columns
The dataset now loads cleanly into a DataFrame with 38 columns covering Star Wars movie viewership, rankings, favorite characters, and demographic information (gender, age, income, education, region).
From here, I’ll inspect the structure, data types, and missing values to prompt the direction of this project.
Missing Value Analysis
The columns with the highest percentage of missing values reveal clear patterns in survey behavior:
Show the code
# Get data types for internal usedatatypes = df.dtypes# Percent missing values of each columnmissing = (df.isna().sum() /len(df) *100).round(1)display(f"Top 10 Missing Values Percentages of Columns")display(missing.sort_values(ascending=False).head(10)) # Show top 10
Because of the extreme problems of missing values in columns, I will now filter on respondents who have at least seen one movie. I will also drop the fan_expanded_universe column as no insights can come from a column of mostly missing values.
Show the code
# Filtered to only include those who have inputted a movie they have seen, don't use seen_any column as many of those who answered 'yes', also left all of those blank when listing the movies they have seendf = df.dropna(subset=df.columns[2:8], how='all')# Remove all rows where income (target variable) is NAdf = df.dropna(subset=['income'])# Drop fan_expanded_universe columndf = df.drop(['fan_expanded_universe'], axis=1)# Select columns 3–8cols = df.columns[2:9]# Replace NaN with "no", everything else with "yes"df[cols] = df[cols].applymap(lambda x: 'yes'if pd.notna(x) else'no')
After the filtering out those who have not seen any star wars movie and those who did not list their income level, the total number of respondents went from 1186, down to 675. This is because it neglects the purpose of the project, which is to use star wars respondents to predict income level. It would not make sense to use peoples responses who have not at least seen one movie nor have listed their income.
One issue I forsee happening is the data is getting smaller and smaller, but still a a decent size. I essentially cut 57% of the respondents out from the survey.
Columns regarding on if the respondent is a fan, or if they have seen certain movies, also have a lot of NA’s. Further investigation reveals they only have one option to choose from, thus I will interpret blank answers as being they have not seen that movie.
Show the code
# Percent missing values of each column, filtered by those have have seen one movie at leastmissing = (df.isna().sum() /len(df) *100).round(1)display(f"Top 10 Missing Values Percentages of Columns")display(missing.sort_values(ascending=False).head(10)) # Show top 10
As you can now see, the data set is a lot cleaner when it comes to missing values.
The next step will be to examine the nature of the target variable and key predictors in order to determine which model(s) will be best suited for this project.
Exploratory Data Analysis
Show the code
# Grab value counts and Get proportion of column it representsincome_counts = df['income'].value_counts(dropna=False)income_props = df['income'].value_counts(normalize=True, dropna=False).round(2)# Display the counts and proportions as a tabledisplay(pd.DataFrame({'Count': income_counts, 'Proportion': income_props}))# Order income for graph mapincome_order = ['$0 - $24,999', '$25,000 - $49,999', '$50,000 - $99,999','$100,000 - $149,999','$150,000+']# Convert to ordered categoricaldf['income'] = pd.Categorical(df['income'], categories=income_order, ordered=True)display(ggplot(df, aes(x='income')) +\ geom_bar(fill="#4C72B0") +\ ggtitle("Income Distribution") +\ xlab("Income Category") +\ ylab("Number of Respondents") + theme(axis_text_x=element_text(angle=30, hjust=1)))
Count
Proportion
income
$50,000 - $99,999
238
0.35
$25,000 - $49,999
147
0.22
$100,000 - $149,999
115
0.17
$0 - $24,999
98
0.15
$150,000+
77
0.11
The income variable contains six categories. with the distribution showing two important points:
The 50,000 - 99,999 income level is decently biased, with about 37% of the respondents falling in. The rest of the income levels, range in between 12-23%. The variation of levels is slightly imbalanced, but we should be able to tweak some things depending on the model route I choose.
Since this is in the incorrect format for analysis, I will bin on 50k or more. We end up getting 64% of answers being 50k or more. We still have an imbalanced dataset.
Show the code
# Convert Income ranges to values 0 for less than 50kincome_map = {'$0 - $24,999': 0,'$25,000 - $49,999': 0,'$50,000 - $99,999': 1,'$100,000 - $149,999': 1,'$150,000+': 1}# Replacedf.loc[:,'income'] = df.loc[:,'income'].replace(income_map)# Grab value counts and Get proportion of column it representsincome_counts = df['income'].value_counts(dropna=False)income_props = df['income'].value_counts(normalize=True, dropna=False).round(2)# Display the counts and proportions as a tabledisplay(pd.DataFrame({'Count': income_counts, 'Proportion': income_props}))
Count
Proportion
income
1
430
0.64
0
245
0.36
I will now move on converting string-ordinal columns, to numeric-ordinal column for the purpose of visualizing, finding correlations, and preparing for modeling.
Show the code
# Convert age to numeric, picking the median of each, 66 for >60 because 72 is average life expectancyage_map = {'18-29': 23, '30-44': 37, '45-60': 52.5, '> 60': 66}# Replacedf['age'] = df['age'].map(age_map)# Preserve order in fav columnsfav_order = ['Very unfavorably','Somewhat unfavorably','Neither favorably nor unfavorably (neutral)','Unfamiliar (N/A)','Somewhat favorably','Very favorably']for col in df.columns[15:29]: df[col] = pd.Categorical(df[col], categories=fav_order, ordered=True)# Convert fav_ columns to number rankfav_map = {'Very favorably': 2,'Somewhat favorably': 1,'Neither favorably nor unfavorably (neutral)': 0,'Unfamiliar (N/A)': 0,'Somewhat unfavorably': -1,'Very unfavorably': -2}# Replacedf.iloc[:, list(range(15,29))] = df.iloc[:, list(range(15,29))].replace(fav_map)
Show the code
# Select favorability columns by name pattern and change dtype for correlation laterrank_cols = [col for col in df.columns if'rank_'in col]df[rank_cols] = df[rank_cols].astype('Int64')
Now, I will show the top correlated variables, strong or negative, with income.
Show the code
# Select all numeric columnsnumeric_cols = [col for col in df.columns if df[col].dtype in ['int64', 'int8', 'Int64', 'Int8', 'float64']]# Calculate correlations with incomecorrelations = df[numeric_cols + ['income']].corrwith(df['income'])correlations = correlations.drop('income').sort_values(key=abs, ascending=False)print("Top 5 correlations with income (>$50k):")display((correlations.head(5)))print("\nBottom 5 correlations:")display(correlations.tail(5))df['income'] = df['income'].astype('bool')
From the aforementioned correlations, we can see certain characters such as Luke, Leia, Obi-wan, and Han Solo, have around a 12-15% correlation with income.
Age, had somewhat to do as most would have guessed, with a 12% correlation with income.
The main standout from this, is no one variable is worth exploring as being a predictor of income by itself, but rather an interaction of multiple variables might be worthy pursuing.
In the coming graphs, I am looking for differences in the distributions between the levels of 2 variables, to assess if their is a multiple relationship between them and income. I will only display typical choices, and ones that are interesting.
Show the code
( ggplot(df, aes(x="age", fill="fav_leia")) + geom_bar(position="dodge") + facet_wrap("income") + labs( title="Star Wars vs Star Trek Fandom by Income", x="Age", y="Count", fill="Luke Fan? (2= very, 1= Somewhat" ) + theme(legend_position ="none"))
Show the code
( ggplot(df, aes(x="age", fill="fav_obi")) + geom_bar(position="dodge") + facet_wrap("income") + labs( title="Star Wars vs Star Trek Fandom by Income", x="Age", y="Count", fill="Luke Fan? (2= very, 1= Somewhat" ) + theme(legend_position ="none"))
Show the code
( ggplot(df, aes(x="age", fill="fav_han")) + geom_bar(position="dodge") + facet_wrap("income") + labs( title="Star Wars vs Star Trek Fandom by Income", x="Age", y="Count", fill="Luke Fan? (2= very, 1= Somewhat" ) + theme(legend_position ="none"))
Show the code
( ggplot(df, aes(x="age", fill="educ")) + geom_bar(position="dodge") + facet_wrap("income") + labs( title="Star Wars vs Star Trek Fandom by Income", x="Age", y="Count", fill="Luke Fan? (2= very, 1= Somewhat" ) + theme(legend_position ="none"))
Show the code
( ggplot(df, aes(x="age", fill="gender")) + geom_bar(position="dodge") + facet_wrap("income") + labs( title="Star Wars vs Star Trek Fandom by Income", x="Age", y="Count", fill="Luke Fan? (2= very, 1= Somewhat" ) + theme(legend_position ="none"))
Show the code
( ggplot(df, aes(x="age", fill="who_shot_first")) + geom_bar(position="dodge") + facet_wrap("income") + labs( title="Star Wars vs Star Trek Fandom by Income", x="Age", y="Count", fill="Luke Fan? (2= very, 1= Somewhat" ) + theme(legend_position ="none"))
Show the code
top_regions = df['region'].value_counts().nlargest(6).indexdf2 = dfdf2['region_plot'] = np.where(df2['region'].isin(top_regions), df2['region'], 'Other')( ggplot(df2, aes(x="region_plot")) + geom_bar() + coord_flip() + facet_grid(y='income', x='gender') + labs(title="Region Distribution by Income and Gender", x="Region", y="Income"))
As I explored multi layered relationships, a few interesting things occured:
The interaction between gender, and education show differing distributions, perhaps uncovering a good predictor of income.
The interactions between age and education also added to a differing distribution.
All in all, the distributions are more similar than different across almost all interactions explored.
Therefore, I will entertain multiple tree models, including Random Forest, K-Nearest Neighbors, Gradient Boosting, Decision Tree, and NB booster. This project is limited by my knowledge, which only includes these ML algorithms as of today.
Model Prep
Show the code
# factorize into 1s 0s education columndf['educ'] = df['educ'].astype('category')df['educ'] = pd.factorize(df['educ'])[0]df['educ'] = df['educ'].astype('category')# Drop ID columndf1 = df.drop(df.columns[0], axis=1)# Separate columns to exclude from one-hot encodingencode_mask = df1.columns.str.contains(r'fav_|income|educ|age', case=False)new_encode = df1.loc[:, ~encode_mask].astype("category")old_encode = df1.loc[:, encode_mask]# One-hot encode the remaining categorical/object columnssw_encoded = pd.get_dummies(new_encode, dummy_na=True, prefix_sep='_', drop_first=False, dtype=int)sw_encoded.columns = sw_encoded.columns.str.replace(' ', '_')# Concatenate back the excluded columnssw = pd.concat([df.iloc[:,0], sw_encoded, old_encode], axis=1)# Drop rows with 7 or more nansw = sw[sw.isna().sum(axis=1) <=7]# Drop rows with nan in outcome variablesw = sw.dropna(subset=['income'])# drop ID columnsw = sw.drop('id', axis=1)
To begin, I transformed age into numeric midpoints and randomly assigned values between 60 and 80 for respondents who reported > 60. The income variable was recoded into a binary classification target, where:
\texttt{income} =
\begin{cases}
0 & \text{if income } < \$50{,}000 \\\\
1 & \text{if income } \geq \$50{,}000
\end{cases}
Favorability ratings for characters (columns prefixed with fav_) were mapped to a numerical scale from -2 (very unfavorable) to +2 (very favorable). I excluded high-cardinality columns from one-hot encoding (fav_, income, educ, and age), then encoded the remaining categorical variables using pd.get_dummies() with dummy_na=True. I then dropped rows with 7 or more missing values or any missing values in the target variable.
Show the code
display(f"{len(df1)} Observations")display(f"Percent of Observations Missing")display((df1.isna().sum() /len(df1) *100).sort_values(ascending=False))
After preprocessing, I assigned the predictors to x and the binary target variable to y = income. The dataset was split into training and test sets using a 75/25 ratio. Missing values in x were either filled with 0 (for models that require complete input like GradientBoostingClassifier and GaussianNB) or left as-is if the model could tolerate it.
I trained and evaluated the following classifiers:
RandomForestClassifier
GradientBoostingClassifier
DecisionTreeClassifier
GaussianNB (Naive Bayes)
Each model’s performance was evaluated using accuracy_score and classification_report from sklearn.metrics.
Show the code
# Assign Predictors and outcome variablesx = sw.drop(columns='income')y = sw['income'].astype(int)
Show the code
# Split dataset into training and testing setsx_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=50)# Split and fill na's for gradient/Gassianx_train_na = x_train.fillna(0)x_test_na = x_test.fillna(0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Random Forest achieved the highest overall accuracy at 69.0%, with strong recall (91%) and F1-score (0.81) for predicting individuals earning at least $50k. However, it performed very poorly on the lower-income class with just 22% f1 and 15% recall.
Gradient Boosting followed with an accuracy of 67.8%. It showed identical results to random forest as far as recall, precision, and f1 go. In essence, great at predicting the true cases, and terrible at the untrue cases.
** Decision Tree classifier follows the same pattern overall as Random Forest and Gradient Boosting Classifiers.
Gaussian Naive Bayes reached 28..8% accuracy but completely failed to classify high-income respondents, with 1% precision. It was heavily biased toward predicting class 2 only.
**K-Neighbors, follows a relatively similar pattern, with 67% accuracy and a little more balanced recall and f1 scores.
Conclusion and Recommendation
While Gradient Boosting offered a more balanced performance across both classes, I believe the Random Forest Classifier is the better choice in this context. Since the primary goal is to accurately predict whether someone earns more than $50,000, overall accuracy and precision for the high-income class matter more than recall for the low-income group. Random Forest achieved the highest accuracy and the strongest precision for class 1, making it more effective at identifying higher earners — which aligns directly with the objective of this analysis.
Feature Importance in Gradient Boosting Results
After training the RandomForestClassifier, I examined which features contributed most to the model’s predictions.
Using the feature_importances_ attribute of the trained model, I identified the top predictors for determining whether a respondent earns over $50,000. These features reflect a combination of demographic factors and Star Wars-related preferences that the model found most useful for predicting income level.
Here are the top 10 most important features:
Show the code
# Get top 10 features by importanceimportances = classifier_DT.feature_importances_features = x_train.columnsimportance_df = pd.DataFrame({'Feature': features, 'Importance': importances})top_features = importance_df.sort_values(by='Importance', ascending=False).head(10)ggplot(top_features, aes(x='Feature', y='Importance', fill='Feature')) +\ geom_bar(stat='identity', color='black', size=0.4) +\ ggtitle('Feature Importance - Random Forest') +\ xlab('Feature') +\ ylab('Importance') +\ theme_minimal() +\ theme( axis_text_x=element_text(angle=45, hjust=1, size=12), axis_text_y=element_text(size=12), plot_title=element_text(size=16, face='bold', hjust=0.5), legend_position='none' )
In the Random Forest model, feature importance was more evenly distributed across a mix of predictors. Among the features, age and educ were most important, which aligns with expected links between demographics and income. Several Star Wars favorability ratings — including fav_jar, fav_darth, fav_anakin, and fav_padme — also contributed meaningfully. This broader distribution of importance suggests the model’s draws on a variety of demographic and preference-based factors.
In conclusion, it seems as if star wars related preferences and education and age, are not very powerful in predicting income being greater than 50k or not.